Some modeling with Favorite Movies from 2023-10-24.
Note: I downloaded Google Sheet to an Excel .xlsx file.
Definitely not before you finish Quiz 2.
I wrote the quiz and sketch in 4.3.1.
After that, it’s up to you. If you use version 4.3.1 or later for your Project B, I’m fine. I’ll start caring about 4.3.2 in 432.
You will probably want to update your packages either way.
I have completed the upgrade to 4.3.2. for these slides, as you can see in the Session Information.
Today’s Packages
library(here)library(readxl)library(janitor)library(gt)library(gtExtras)library(mosaic)library(patchwork)library(naniar)library(mice)library(mitml)library(car); library(GGally)library(corrplot); library(ggmice) # new packages todaylibrary(broom)library(xfun)library(tidyverse)theme_set(theme_bw())
Ingesting the data
movie_raw <-read_excel(here("c25/data/movies_2023-10-24.xlsx"),na =c("", "NA")) |># otherwise only "" is recognizedclean_names() |>type.convert(as.is =FALSE) |># convert all characters to factorsmutate(film_id =as.character(film_id), film =as.character(film))movies <- movie_raw |>select(film_id, imdb_pct10, fc_pctwins, rt_audiencescore, ebert, box_off_mult, budget, metascore, bw_rating, imdb_oscars, mentions, dr_love, gen_1, bacon_1, lang_1, drama, comedy, adventure, action, romance, fantasy, sci_fi, crime, thriller, animation, family, mystery, biography, music, horror, musical, war, history, sport, western, film)dim(movies)
[1] 201 36
Quick Check of Ingest
summary(movies)
film_id imdb_pct10 fc_pctwins rt_audiencescore
Length:201 Min. : 3.80 Min. :24 Min. :28.00
Class :character 1st Qu.:11.60 1st Qu.:42 1st Qu.:76.00
Mode :character Median :15.60 Median :52 Median :86.00
Mean :17.53 Mean :51 Mean :81.91
3rd Qu.:22.20 3rd Qu.:60 3rd Qu.:92.00
Max. :55.00 Max. :79 Max. :98.00
ebert box_off_mult budget metascore
Min. :1.000 Min. : 0.0013 Min. : 200000 Min. : 9.00
1st Qu.:2.875 1st Qu.: 2.6000 1st Qu.: 12000000 1st Qu.: 61.00
Median :3.500 Median : 4.7000 Median : 30000000 Median : 72.00
Mean :3.190 Mean : 8.5418 Mean : 59242257 Mean : 71.35
3rd Qu.:4.000 3rd Qu.: 9.3000 3rd Qu.: 90000000 3rd Qu.: 84.00
Max. :4.000 Max. :73.7000 Max. :356000000 Max. :100.00
NA's :25 NA's :20 NA's :19 NA's :10
bw_rating imdb_oscars mentions dr_love gen_1
Min. :0.000 Min. : 0.0000 Min. :1.000 No :124 F: 45
1st Qu.:1.000 1st Qu.: 0.0000 1st Qu.:1.000 Yes: 77 M:156
Median :3.000 Median : 0.0000 Median :1.000
Mean :2.135 Mean : 0.9849 Mean :1.249
3rd Qu.:3.000 3rd Qu.: 1.0000 3rd Qu.:1.000
Max. :3.000 Max. :11.0000 Max. :6.000
NA's :8 NA's :2
bacon_1 lang_1 drama comedy
Min. :1.000 English :177 Min. :0.0000 Min. :0.0000
1st Qu.:2.000 Japanese: 7 1st Qu.:0.0000 1st Qu.:0.0000
Median :2.000 Hindi : 5 Median :1.0000 Median :0.0000
Mean :1.886 Italian : 2 Mean :0.5721 Mean :0.3582
3rd Qu.:2.000 Arabic : 1 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :3.000 ASL : 1 Max. :1.0000 Max. :1.0000
(Other) : 8
adventure action romance fantasy
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
Mean :0.3333 Mean :0.2537 Mean :0.1692 Mean :0.1393
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
sci_fi crime thriller animation
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
Median :0.0000 Median :0.0000 Median :0.0000 Median :0.00000
Mean :0.1244 Mean :0.1045 Mean :0.1045 Mean :0.08955
3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
family mystery biography music
Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.00000
1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
Median :0.00000 Median :0.0000 Median :0.00000 Median :0.00000
Mean :0.08955 Mean :0.0597 Mean :0.05473 Mean :0.05473
3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.00000
horror musical war history
Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.00000
1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
Median :0.0000 Median :0.00000 Median :0.0000 Median :0.00000
Mean :0.0398 Mean :0.02985 Mean :0.0199 Mean :0.01493
3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000
Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.00000
sport western film
Min. :0.00000 Min. :0.000000 Length:201
1st Qu.:0.00000 1st Qu.:0.000000 Class :character
Median :0.00000 Median :0.000000 Mode :character
Mean :0.01493 Mean :0.004975
3rd Qu.:0.00000 3rd Qu.:0.000000
Max. :1.00000 Max. :1.000000
Data Cleaning
Let’s convert budget to express it in millions of US dollars
lang_eng should be 1/0 for English (n = 177) vs. Non-English
movies <- movies |>mutate(budget = budget /1000000,lang_eng =as.numeric(lang_1 =="English"))favstats(~ budget, data = movies) |>gt() |>fmt_number(columns = mean:sd, decimals =2)
min
Q1
median
Q3
max
mean
sd
n
missing
0.2
12
30
90
356
59.24
68.02
182
19
movies |>tabyl(lang_eng, lang_1) |>gt()
lang_eng
Arabic
ASL
Bengali
Danish
English
French
German
Hindi
Italian
Japanese
Mandarin
Norwegian
Persian
Spanish
0
1
1
1
1
0
1
1
5
2
7
1
1
1
1
1
0
0
0
0
177
0
0
0
0
0
0
0
0
0
Which outcome shall we choose?
We’re interested in a percentage measure (0-100) addressing how beloved the movie is, according to an audience.
Variable
NA
Description
imdb_pct10
0
% of 10-star public ratings in IMDB as of 2023-09
fc_pctwins
0
% of matchups won on Flickchart as of 2023-10
rt_audiencescore
0
Rotten Tomatoes Audience Score (% Fresh) as of 2023-10
top five genres: drama, comedy, adventure, action, romance
How many predictors can we use?
If we have a linear regression model with 201 observations (at most, some variables are missing, remember), then how many predictors can we realistically fit?
A useful starting strategy when you’re not doing variable selection is that you need at least 15 observations for each coefficient you will estimate, including the intercept.
film_id fc_pctwins imdb_pct10 rt_audiencescore
Length:201 Min. :24 Min. : 3.80 Min. :28.00
Class :character 1st Qu.:42 1st Qu.:11.60 1st Qu.:76.00
Mode :character Median :52 Median :15.60 Median :86.00
Mean :51 Mean :17.53 Mean :81.91
3rd Qu.:60 3rd Qu.:22.20 3rd Qu.:92.00
Max. :79 Max. :55.00 Max. :98.00
box_off_mult metascore imdb_oscars bw_rating
Min. : 0.0013 Min. : 9.00 Min. : 0.0000 Min. :0.000
1st Qu.: 2.8000 1st Qu.: 61.00 1st Qu.: 0.0000 1st Qu.:1.000
Median : 4.7000 Median : 71.00 Median : 0.0000 Median :3.000
Mean : 8.5436 Mean : 71.11 Mean : 0.9751 Mean :2.124
3rd Qu.: 9.6000 3rd Qu.: 83.00 3rd Qu.: 1.0000 3rd Qu.:3.000
Max. :73.7000 Max. :100.00 Max. :11.0000 Max. :3.000
lang_eng drama comedy film
Min. :0.0000 Min. :0.0000 Min. :0.0000 Length:201
1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
Median :1.0000 Median :1.0000 Median :0.0000 Mode :character
Mean :0.8806 Mean :0.5721 Mean :0.3582
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000